FooPar: A Functional Object Oriented Parallel 

Framework in Scala 

Felix P. Hargreaves and Daniel Merkle 
^ Department of Mathematics and Computer Science 

University of Southern Denmark 
<N {daniel,felhar07}@imada.sdu.dk 

^ April 10, 2013 

r \ Abstract 

Q We present FooPar, an extension for highly efficient Parallel Computing in the multi- 

paradigm programming language Scala. Scala offers concise and clean syntax and inte- 

i i grates functional programming features. Our framework FooPar elegantly combines 

these features with parallel computing techniques. FooPar is highly modular and sup- 
ports easy access to different communication backends for distributed memory architec- 

(^ tures as well as high performance math libraries. In this article we use it to parallelize 

the Floyd- Warshall algorithm and matrix-matrix multiplication. For the latter we show 
its scalability by a thorough isoefficiency analysis. In addition, results based on a large- 
.• scale empirical analysis on two supercomputers are given. We achieve close-to-optimal 

performance wrt. theoretical peak performance. Based on this result we conclude that 
FooPar allows to fully access Scala's design features without suffering from performance 
drops when compared to implementations purely based on C and MPI. 



1 Introduction 



C3 



FooPar is a data structure centric library more-so than it is a framework. It targets the 
following problems: deadlocks, race conditions, network programming and complexity of code. 
Algorithms in FooPar work solely through group operations on collection classes, thus elimi- 
nating user interaction with message passing and synchronization issues. The benefit from this 
level of abstraction is that deadlocks and race conditions are practically eliminated. While the 
exclusion of explicit message passing might seem drastic, this paper will show that, not only 
can classical parallel algorithms be modeled within this framework, it also leads to concise, 
efficient and analyzable algorithms. 

FooPar differs from other functional parallel programming frameworks in some key aspects. 
Where many other frameworks like Haskell's Eden [T5] and Scala's Spark [TS] focus on workload 
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balancing, FooPar leaves this part to the user in order to make algorithms analyzable and 
efficient and aims more at HPC applications. 

FooPar uses the functional and object oriented programming language Scala [16] for many 
reasons. Scala is short for scalable language pointing towards its ability to make user defined 
abstractions seem like first class citizens in the language. The object oriented aspect leads to 
concise and readable syntax when combined with operator overloading, e.g. in matrix opera- 
tions. Another benefit of Scala is the underlying platform, the Java Virtual Machine (JVM). 
JVM is a mature platform available for all relevant architectures. It hosts a plethora of modern 
programming languages allowing for complete freedom across paradigms and platforms. Effi- 
ciency of byte-code can approach that of optimized C-implementations within small constants 
[H] • Many of the languages on JVM are managed and utilize garbage collection. While this can 
cause performance penalties, they are easily outweighed by the benefits of memory abstraction 
and memory integrity. 

A performance boost can be gained by using Java Native Interface; however, this adds an 
additional linear amount of work due to memory being copied between the virtual machine and 
the native program. In other words, super linear workloads motivate the usage of JNI. 

In this paper (after definitions and a brief introduction to isoefficiency in Section [2]) we will 
introduce FooPar in Section [3] and describe its architecture, data structures, and operations 
it contains. The complexity of the individual operations on the (parallel) data structures will 
be shown to serve as basis for parallel complexity analysis. A matrix-matrix multiplication 
algorithm will be designed using the functionality of FooPar; two implementations will be 
analyzed with an isoefficiency analysis in Section [4j In Section [5j a parallel implementation of 
the Floyd- Warshall algorithm will be presented. Test results showing that FooPar can reach 
close-to theoretical peak performance on large supercomputers will be presented in Section [6j 
We conclude with Section [7l 

2 Definitions, Notations, and Isoefficiency 

The most widespread model for scalability analysis of heterogeneous parallel systems (i.e. the 
parallel algorithm and the parallel architecture) is isoefficiency [UJ [6] analysis. The isoefficiency 
function for a parallel system relates the problem size W and the number of processors p and 
defines how large the problem size as a function in p has to grow in order to achieve a constant 
pre- given efficiency Isoefficiency has been applied to a wide range of parallel systems (see, 
e.g. [5].[TU].|2]). As usual, we will define the message passing costs, t c , for parallel machines as 
t c := t s — t w ■ m, where t s is the start-up time, t w is the per- word transfer time, and m is the 
message size. The sequential (resp. parallel) runtime will be denoted as Ts (resp. Tp). The 
problem size W is identical to the sequential runtime, i.e. W := T5. The overhead function will 
be defined as T (W,p) := pTp — Ts . The isoefficiency function for a parallel system is usually 
found by an algebraic reformulation of the equation 

W = k-T (W,p) 
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Figure 1: Conceptional overview of the layered architecture of FooPar. 

such that W is a function in p only (see e.g. [6] for more details). In this paper we will 
employ broadcast and reduction operations for isoefficiency analysis for parallel matrix-matrix 
multiplication with FooPar. Assuming a constant cross-section bandwith of the underlying 
network and employing recursive doubling leads to a one-to-all broadcast computational runtime 
of (t s + t w ■ m) \ogp and the identical runtime for an all-to-one reduction with any associative 
operation A. All-to-all broadcast and reduction have a runtime of t s \ogp + t w ■ (p — 1). A 
circular shift can be done in runtime t s + t w ■ m if the underlying network has a cross-section 
bandwith of 0(p). 

A parallel system is cost- optimal if the processor-time product has the same asymptotic 
growth as the parallel algorithm, i.e. p ■ T P G 0(Tg). 



3 The FooPar Framework 



FooPar is a highly modular extension to Scala which supports user extensions and additions 
to data structures with reference design patterns. Fig. [T] depicts the architecture of FooPar. 

Using the builder /traversable pattern [15] , one can create maintainable distributed collection 
classes while benefiting from the underlying modular communication layer. In turn, this means 
that user provided data structures receive the same benefits from the remaining layers of the 
framework as the ones that ship with FooPar. It is, however, entirely possible to design a 
vast array of algorithms using purely the data structures within FooPar. 

A configuration of FooPar can be described as FooPar-X-Y-Z, where X is the commu- 
nication module, and Y is the native code used for networking and Z is the hardware config- 
uration, e.g. Xg { MPJ-Express, OpenMPI, FastMPJ, SharedMemory }, Ye {MPI, Sockets} 
and Zg {SharedMemory, Cluster, Cloud}. Note that this is not an exhaustive listing of module 
possibilities. In this paper we only use Y=MPI and Z=Cluster and do not analyze Shared 
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Figure 2: A distributed map operation. Figure 3. Output of the distributed map oper- 
ation (arbitrary order). 

Memory parallelisation. Therefore, we will only use the notation FooPar-X. 

3.1 Technologies 

Currently, FooPar uses the newest version of Scala 2.10 which gives access to experimental 
reflection support. This is currently used to allow user defined serializers for collections with a 
fallback to Java's byte serializer. The Scalacheck framework is a specification testing framework 
for Scala which is used to test the methods provided by FooPar data structures. JBLAS, a 
high performing linear algebra library [1] using BLAS via JNI is used to benchmark FooPar 
with an implementation of distributed matrix-matrix multiplication. Intel® 's Math Kernel 
Library offers an high-performing alternative with Java bindings, and will also be used for 
benchmarking. 

3.2 SPMD Operations on Distributed Sequences 

FooPar is inspired by the SPMD/SIMD Principle often seen in parallel hardware [3]. The 
Option monad in Scala is a construct similar to Haskell's maybe monad. Option is especially 
suited for SPMD patterns since it supports map and f oreach operations. The following exem- 
plifies this characteristic approach in FooPar: 



def ones(i: Int) : Int = i .toBinaryString. count (_ == ' 1') 
val seq = to worldSize - 3 
val counts = seq mapD ones 
println(globalRank+" : "+counts) 

Here, ones(i) counts the number of l's in the binary representation of i. mapD distributes the 
map operation on the Scala range seq. 

In SPMD, every process runs the same program, i.e. every process generates seq in line 
3. If combined with lazy-data objects, this does not lead to unnecessary space or complexity 
overhead (cmp. Fig. 2 and 3). While every process generates the sequence, only some processes 
perform the mapD operation. 



3.3 Data Structures 

FooPar relies heavily on the interpretation of data structures as process-data mappings. As 
opposed to many modern parallel programming tools, FooPar uses static mappings defined by 
the data structures and relies on the user to partition input. This decision was made to ensure 
efficiency and analyzability. By using static mappings in conjunction with SPMD, the overhead 
and bottleneck pitfalls induced by master slave models are avoided and program-simplicity and 
efficiency are achieved. In FooPar, data partitioning is achieved through proxy- or lazy objects, 
which are easily defined in Scala. In its current state, FooPar supports distributed singletons 
(aka. distributed variables), distributed sequences and distributed multidimensional sequences. 
The distributed sequence combines the notion of communication groups and data. By allowing 
the dynamic creation of communication groups for sequences, a total abstraction of network 
communication is achieved. Furthermore, a communication group follows data structures for 
subsequent operations allowing for advanced chained functional programming to be highly 
parallelized. Tab.[T]lists the currently supported operations on distributed sequences. The given 
runtimes are actually achieved in FooPar, but of course they depend on the implementation 
of collective operations in the communication backend. A great advantage of excluding user 
defined message passing is gaining analyzability through the provided data-structures. 

4 Matrix- Matrix Multiplication in FooPar 

4.1 Serial Matrix-Matrix Multiplication 

Due to the abstraction level provided by the framework, algorithms can be defined in a fashion 
which is often very similar to a mathematical definition. Matrix-matrix multiplication is a good 
example of this. The problem can be defined as follows: 

n-l 

(AB)ij := 2 ;Ai tk B k j 

where n is the number of rows and columns in matrices A and B respectively. In functional 
programming, list-operations can be used to model this expression in a concise manner. The 
three methods, zip, map and reduce are enough to express matrix-matrix multiplication as 
a functional program. A serial algorithm for matrix-matrix multiplication based on a 2d- 
decomposition of the matrices could look like this: 

Qj <- reduce (+) (zipWith (•) A„ B^), V(z,j) E 11 x 11 

Here, 1Z = {0, . . . , q — 1}, and the sub-matrices are of size (n/q) 2 . Operation zipWith is a 
convenience method roughly equivalent to: map o zip, which takes 2 lists and a 2-arity function 
to combine them. 



Operation 


Semantic 


Notes 


T p (parallel runtime) 


mapD(A) 


Each process trans- 
forms one element 
of the sequence us- 
ing operation A (el- 
ement size to) 


This is a non- 
communicating 
operation 


6(t a (to)) 


zipWithD ( A, a) 


Combines this 
sequence with se- 
quence a using 
operator A (element 
size to) 


a must be of same 
size as callee; 
this is a non- 
communicating 
operation 


0(T a (to)) 


reduceD(A) 


The sequence with p 
elements is reduced 
to the root process 
using operation A 


A must be an as- 
sociative operator 


Q(\ogp(t s + t w m + T x (m))) 


shiftD(5) 


The sequence (el- 
ement size to) is 
shifted cyclically by 
S elements 




Q(t s + t w m) 


allToAUD 


Process % sends the 
jth element to pro- 
cess j 




Q(t s logp + t w m(p- 1) 


allGatherD 


All processes obtain 
a list where element 
% comes from pro- 
cess % 


Process i provides 
the valid ith ele- 
ment 


Q((t s + t w m)(p - I)) 


apply (i) 


All processes obtain 
the ith element of 
the sequence 


sementically iden- 
tical to a one-to- 
all broadcats 


0(\ogp(t s + t w m)) 



Table 1: Standard operations on distributed sequences in FooPar. 

4.2 Generic Algorithm for Parallel Matrix-Matrix Multiplication 

To illustrate the simplicity of complexity analysis, the parallel version of the algorithm can be 
written in a more verbose fashion as follows: 



Cij <— reduceD 



(mapD (•) (zip A* B 1 ^)), V(i,j) G K x K 

Operation zip is G 0(1) due to lazy evaluation. We use a block size to = (n/q) 2 . For mapD 
(multiplication of sub-matrices) we have T mult (TO) = 0(to 3//2 ), for reduceD (summation of sub- 
matrices) we have T sum (m) = ©(to). In asymptotic terms the parallel runtime Tp is therefore: 

zip mapD reduceD 

T P = ©00 + eiHqf) '+e((n/q) 2 logqj 
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1 


/* 




2 


* Initialize matrices 




3 


*/ 




4 


val A = Array. f ill (M, M) (MJBLProxy(SEED, b)) 




5 
6 

7 


val Bt = Array. fill(M, M) (MJBLProxy(SEED, b)) .transpose 




/* 




8 


* Multiply matrices 




9 


*/ 




10 


for (i <- until M; j <- until N) 
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A(i) zip Bt(j) mapD { case (a, b) => a * b } reduceD (_ + _) 



Algorithm 1: Generic algorithm for matrix-matrix multiplication with FooPAR. 

Since Cjj is independent both in i and j, the q 2 operations can all run in parallel. Using q 
processors per reduction leads then to p = q 2 ■ q processors and the overall asymptotic runtime 
Q((n/p) 2 \ogp). 

Using the framework, some parts of the analysis can be carried out independently of the 
lambda operations used in an algorithm. What is left is a generic algorithm which shows pre- 
cisely the communication pattern used in the algorithm. As a coincidence, the communication 
pattern is essentially identical to that of the well known DNS algorithm [I], [8]. 

Algorithm [I] shows a complete FooPar implementation, which is practically identical to the 
pseudo code. Note, that the algorithm uses proxy-objects which are simply objects containing 
lazy data using Scala's lazy construct 



4.2.1 IsoefRciency Analysis for the Generic Algorithm: 

We start by determining the non-asymptotic parallel runtime. We assume the number of 
processors is p = q 3 (i.e. q = p 1 ' 3 ) and matrices A and B of size n x n. Splitting A and B into 
q x q blocks leads to a block size of (n/q) 2 . The zip operation has a runtime of q 2 due to nop 
instructions carried out in iterations where the current process is not assigned to the operation. 
An implicit conversion (runtime q 2 ) is needed to extend the functionality of standard Scala 
arrays. The mapD operation has a runtime of q 2 + (n/q) 3 and the reduceD operation has a 
runtime of q 2 + logg + (n/q) 2 logq. As q 2 = p 2//3 , this leads to an overall parallel runtime of 

T p = 4 • p 2 ' 3 + j + 1/3 (logp + (J^J logp) , 

and the corresponding cost p ■ Tp e 0(4p 5//3 + n 3 ). Therefore this approach is cost-optimal for 
p E 0(n 9 / 5 ). The overhead for this basic implementation is 

T = pT p -T s = 4p 5 / 3 + P - (logp + (^\ logp) . 
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Figure 4: a) Process (i,j,k) contains blocks A^/- and Bkj b) local multiplication Cy = Ai t k x 
B^j, c) after reduction (summation): process (i,j,0) contains the (partial) result matrix. 



val 


R = 


until DIM 












val 


G = 


Grid3D(R, 


R, R) 












val 


GA 


= G mapD { 


case (i, 


j» 


k) 


=> A(. 


L)(k) 


} 


val 


GB 


= G mapD { 


case (i, 


j» 


k) 


=> B(k)(j) 


} 


val 


C = 


((GA zipWithD GB) ( 


* 


_) 


zSeq) 


reduceD (_ + _) 



Algorithm 2: Matrix- matrix multiplication in FooPar using Grid Abstraction. 

Following an isoefficiency analysis based on W = K ■ T (W,p) leads to 

W = n 3 = K4p 5/3 + Kp ( logp + ( -^ J logp ) . 

Examining the terms individually shows that the first term of K ■ T (W,p) constraints the 
scalability the most. Therefore, the isoefficiency function for the basic algorithm is W G G(p 5 ' 3 ). 
Fig. [1] shows the communication pattern implemented by Algorithm [TJ 



4.3 Grid Abstraction in FooPar for Parallel Matrix-Matrix Multi- 
plication 

In [7] an isoefficiency function in the order of Q(p\og 3 p) was achieved by using the DNS 
algorithm for matrix-matrix multiplication. The bottleneck encountered in the basic imple- 
mentation is due to the inherently sequential for loop emulating the V quantifier. Though 
Scala offers a lot of support for library- as-DSL like patterns, there is no clear way to offer safe 
parallelisation of nested for loops while still supporting distributed operations on data struc- 
tures. To combat this problem, FooPar supports multidimensional distributed sequences in 
conjunction with constructors for arbitrary Cartesian grids. Grid3D is a special case of GridN, 
which supports iterating over 3D-tuples as opposed to coordinate lists. Using Grid3D an algo- 
rithm for matrix-matrix multiplication can be implemented as seen in Algorithm [2j zSeq is 
a convenience method for getting the distributed sequence, which is variable in z and constant 



in the x, y coordinates of the current process. By using the grid data structure, we safely 
eliminate the overhead induced by the for-loop in Algorithm [l] and end up with the same basic 
communication pattern as shown in Fig. El Operation mapD has a runtime of Q((n/q) 3 ) and 
reduceD a runtime of 0(logg + (n/q) 2 logp). Due to space limitations we will not present 
the details of runtime and isoefficiency analysis but refer to [8] , as the analysis given there is 
very similar. Parallel runtime, Tp, and cost are given by Tp = n 3 /plogp + (n 2 /p 2 ^ 3 ) logp and 
cost G Q(n 3 + plogp + n 2 p l ^ 3 \ogp). This leads to an isoefficiency function in the order of 
G(jolog 3 p), identical to the isoefficiency achieved by the DNS algorithm. 

5 Floyd- Warshall Algorithm 

The Floyd- Warshall algorithm solves the all-pairs shortest path problem between all nodes 
Vi, . . . , v n in a weighted graph. Let d\ • be the weight of the minimum-weight path between Vi 
and Vj among vertices in the set {v i, . . . , t> fc }. The weight of an edge between nod v j and v j is 
denoted as w(vi,Vj). The dynamic programming formulation can be expressed as: 



d: 



t-.i 



w(vi, Vj) , k — 



The shortest path from Vi to Vj is then given by d™ •. We follow the parallelisation approach 
from [12] and present a scalable version of the parallel Floyd- Warshall algorithm that employs 
the 2D Grid Abstraction (size p = q 2 ) in FooPar. In Algorithm M its parallel implementation 
is given. Very briefly explained, lines 1-3 initialize the 2D grid and line 5 is the inherent 
sequential loop of the algorithm, which is safely modeled as a standard for loop. Line 6 gets the 
(k mod By th row of the [k/B^th block in the column of the calling process. Similarly, line 7 
gets the (k mod _B)'th column of the \_k/B\ , th block in the row of the calling process. Lines 
9-14 transform the grid into the next iteration by updating each block in parallel. While the 

mapD apply 

outer loop has n iterations, lines 6 and 7 run in Tp = Q(B) + ®((t s + t w B) log 9). Lines 9-14 
run in 6(£> 2 ). Since B = -4= and q = y/p, the total parallel runtime is Tp = Q(n( J p + (t s + 

2 

t w J i=)logy/p + — )), leading to a scalable algorithm with isoefficiency function in the order of 
6(( v ^logp) 3 ). 

6 Test Results 

Due to space limitations we only present results for matrix-matrix multiplication. 

Parallel Systems and their Interconnection Framework: In this study we focus on 
analyzing scalability, efficiency and flexibility. We tested FooPar on two parallel systems: the 
first system is called Carver and is used to analyze the peak performance and the overhead 
of FooPar. It is an IBM iDataPlex system where each computing node consists of two Intel 
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1 


val R = until BLOCKS. size // range for all q=sqrt(p) ' 'BLOCKS' ' per dim 


2 


val BR = until B // B = n/q is the size of a BLOCK 


3 

4 
5 


var grid = GridN(R, R) mapD { case i :: j :: Nil => BL0CKS(i)(j) } 


for (k <- until n) { 


6 


val ik = grid.xSeq.mapD(_(k % B)).apply(k / B) 


7 
8 
9 


val kj = grid. ySeq.mapD(_. map (_(k °/ B))).apply(k / B) 


grid = grid.mapD { block => 


10 


for (i <- BR; j <- BR) { 


11 


block(i)(j) = math.min(block(i) (j) , ik(j) + kj(i)) 


12 


} 


13 


block 


14 


} 


15 


} 



Algorithm 3: Implementation of the parallel Floyd- Warshall algorithm in FooPar. 

Nehalem quad-core processors (2.67 GHz processors, each node has at least 24GB of RAM). The 
system is located at the Department of Energy's National Energy Research Scientific Computing 
Center (NERSC). All nodes are interconnected by 4X QDR InfiniBand technology, providing 
maximally 32 Gb/s of point-to-point bandwidth. A highly optimized version of Intel's Math 
Kernel Library (MKL) is used, which provides an empirical peak performance of 10.11 GFlop/s 
on one core (based on a single core matrix-matrix multiplication in C using MKL). This will 
be our reference performance to determine efficiency on Carver. Note, that the empirical peak 
performance is very close to the theoretical peak performance of 10.67 GFlop/s on one node. 
The largest parallel job in Carver's queuing system can use maximally 512 cores, i.e. the 
theoretical peak is 5.46 TFlop/s. 

The second system has basically the same hardware setup. The name of the system is 
Horseshoe-6 and it is located at the University of Southern Denmark. Horseshoe-6 is used in 
order to test the flexibility of FooPar. The math libraries are not compiled towards the node's 
architecture, but a standard high performing BLAS library was employed for linear algebraic 
operations. The reference performance on one core was measured again by a matrix-matrix 
multiplication (C-version using BLAS) and is 4.55 GFlop/s per core. 

On Carver Java bindings of the nightly-build OpenMPI version 1.9alr27897 [3] were used in 
order to interface to OpenMPI (these Java bindings are not yet available in the stable version 
of OpenMPI). On Horseshoe-6 we used three different communication backends, namely i.) 
OpenMPI Java bindings (same version as on Carver), ii.) MPJ-Express [17] . and iii.) FastMPJ 
[18]. Note, that changing the communication backend does not require any change in the Scala 
source code for the parallel algorithmic development within FooPar. 

For performance comparison of FooPar and C we also developed a highly optimized parallel 
version of the DNS algorithm for matrix-matrix multiplication, using C/MPI. MKL (resp. 
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BLAS) was used on Carver (resp. Horseshoe-6) for the sub-matrix-matrix multiplication on 
the individual cores. Note, that the given efficiency results basically do not suffer any noticable 
fluctuations when repeated. 

Results on Carver: Efficiencies for different matrix sizes, n, and number of cores, p, are 
given in Fig|5} As communication backend, we used OpenMPI. We note that we improved 
the Java implementation of MP I .Reduce in OpenMPI: the nightly build version implements an 
unnecessarily simplistic reduction with Q(p) send/receive calls, although this can be realized 
with 0(logp) calls. I.e., the unmodified OpenMPI does not interface to the native MPI Jteduce 
function, and therefore introduces an unnecessary bottleneck. 

For matrix sizes n = 40000 and the largest number of cores possible (i.e. p = 512) Algo- 
rithm [2] achieves 4.84 TFlop/s, corresponding to 88.8% efficiency w.r.t. the theoretical peak 
performance (i.e. 93.7% of the empirically achievable peak performance) of Carver. The C- 
version performs only slightly better. Note, that the stronger efficiency drop (when compared 
to Horseshoe-6 results for smaller matrices) is due to the high performing math libraries; the 
absolute performance is still better by a factor of ~ 2.2. We conclude that the computation 
and communication overhead of using FooPar is neglectable for practical purposes. While 
keeping the advantages of higher-level constructs, we manage to keep the efficiency very high. 
This result is in line with the isoefficiency analysis of FooPar in Section |4| 

Results on Horseshoe-6: On Horseshoe-6 we observed that the different backends lead to 
rather different efficiencies. When using the unmodified OpenMPI as a communication backend, 
a performance drop is seen, as expected, due to the reasons mentioned above. Also MPJ-Express 
uses an unnecessary 0(p) reduction (FastMPJ is closed source). However, if FooPar will not 
be used in an HPC setting and efficiency is not be the main objective (like a in a heterogeneous 
system or a cloud environment), the advantages of "slower" backends (like running in daemon 
mode) might pay off. 

7 Conclusions 

We introduced FooPar, a functional and object-oriented framework that combines two orthog- 
onal scalabilities, namely the scalability as seen from the perspective of the Scala programming 
language and the scalability as seen from the HPC perspective. FooPar allows for isoefficiency 
analyses of algorithms such that theoretical scalability behavior can be shown. We presented 
parallel solutions in FooPar for the all-pairs-shortest-paths problem and for matrix-matrix 
multiplication. For the latter we supported the theoretical finding with empirical tests that 
reached close-to-optimal performance w.r.t. the theoretical peak performance on 512 cores. 
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Figure 5: Efficiency results for matrix-matrix multiplication (size nxn) with Grid Abstraction; 
x-axis: number of cores used; the value for n and the communication backend employed are 
given in the legend. Left: results on Carver, Right: results on Horseshoe-6; efficiency is given 
relative to empirical peak performance on one core (see text). 
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